The following exploratory data analysis is an
investigation of the factors influencing the
borrower rate of loans from the Prosper Loan Company
from the second quarter of 2006 through the first
quarter of 2014.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1340 0.1840 0.1928 0.2500 0.4975
BorrowerRate Ranges from 0 to .05
## 10%
## 0.09886
## 90%
## 0.3099
Most BorrowerRates are between .099 and .310
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6500 8337 12000 35000
all loans are between 0 and 35,000…
75% of loans are under 12,000…
All loans have a term of either 1,3 or 5 years.
## Employed Full-time Not available Not employed
## 2255 67322 26355 5347 835
## Other Part-time Retired Self-employed
## 3806 1088 795 6134
The large majority of loans go to individuals who have a status of either “Employed” or “Full Time”. No surprises here.
## False True
## 56459 57478
About half of the Borrowers are homeowners
## AK AL AR AZ CA CO CT DC DE FL GA
## 5515 200 1679 855 1901 14717 2210 1627 382 300 6720 5008
## HI IA ID IL IN KS KY LA MA MD ME MI
## 409 186 599 5921 2078 1062 983 954 2242 2821 101 3593
## MN MO MS MT NC ND NE NH NJ NM NV NY
## 2318 2615 787 330 3084 52 674 551 3097 472 1090 6729
## OH OK OR PA RI SC SD TN TX UT VA VT
## 4197 971 1817 2972 435 1122 189 1737 6842 877 3278 207
## WA WI WV WY
## 3048 1842 391 150
There is a pretty even distribution of borrowers among the States, when taking state population into account.
## Accountant/CPA
## 3588 3233
## Administrative Assistant Analyst
## 3688 3602
## Architect Attorney
## 213 1046
## Biologist Bus Driver
## 125 316
## Car Dealer Chemist
## 180 145
## Civil Service Clergy
## 1457 196
## Clerical Computer Programmer
## 3164 4478
## Construction Dentist
## 1790 68
## Doctor Engineer - Chemical
## 494 225
## Engineer - Electrical Engineer - Mechanical
## 1125 1406
## Executive Fireman
## 4311 422
## Flight Attendant Food Service
## 123 1123
## Food Service Management Homemaker
## 1239 120
## Investor Judge
## 214 22
## Laborer Landscaping
## 1595 236
## Medical Technician Military Enlisted
## 1117 1272
## Military Officer Nurse (LPN)
## 346 492
## Nurse (RN) Nurse's Aide
## 2489 491
## Other Pharmacist
## 28617 257
## Pilot - Private/Commercial Police Officer/Correction Officer
## 199 1578
## Postal Service Principal
## 627 312
## Professional Professor
## 13628 557
## Psychologist Realtor
## 145 543
## Religious Retail Management
## 124 2602
## Sales - Commission Sales - Retail
## 3446 2797
## Scientist Skilled Labor
## 372 2746
## Social Worker Student - College Freshman
## 741 41
## Student - College Graduate Student Student - College Junior
## 245 112
## Student - College Senior Student - College Sophomore
## 188 69
## Student - Community College Student - Technical School
## 28 16
## Teacher Teacher's Aide
## 3759 276
## Tradesman - Carpenter Tradesman - Electrician
## 120 477
## Tradesman - Mechanic Tradesman - Plumber
## 951 102
## Truck Driver Waiter/Waitress
## 1675 436
## A AA B C D E HR
## 29084 14551 5372 15581 18345 14274 9795 6935
## A AA B C D E HR NC
## 84984 3315 3509 4389 5649 5153 3289 3508 141
There are a large number of Borrowers that don’t have a prosper rating or don’t have a credit score…
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 660.0 680.0 685.6 720.0 880.0 591
Most borrowers have a credit score between 600 and 800
## $0 $100,000+ $1-24,999 $25,000-49,999 $50,000-74,999
## 621 17337 7274 32192 31050
## $75,000-99,999 Not displayed Not employed
## 16916 7741 806
Few loans are given out to individuals with low income, or who are unemployed, no surprises here.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3200 4667 5608 6825 1750000
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.140 0.220 0.276 0.320 10.010 8554
Most borrowers have a DebtToIncomeRatio in the range of 0 to 0.5. Those with a ratio of 10 are probably errors.
We can see here that the number of loans given out dropped dramatically in the last quarter of 2008, likely due to the recession.
There are 113,937 loans in the dataset with 81 features, 13 of which were used in the analysis:
BorrowerRate
CreditScoreRangeLower
DebtToIncomeRatio
LoanOriginalAmount
StatedMonthlyIncome
60,36,12
(note that term was a numerical variable but was transformed to a factor variable because it has very few values)
AA,A,B,C,D,E,HR,NC,none
AA,A,B,C,D,E,HR,none
$75,000-99,999 ; $50,000-74,999 ; $1-49,999 1-24,999 ; $0 ; Not employed; Not displayed
Employed, Full-time, Not employed, Part-time, Retired, Self-employed none,Not available, Other,
Accountant/CPA, Administrative Assistant, Analyst, Architect, Attorney, Biologist, Bus Driver, Car Dealer, Chemist, Civil Service, Clergy, Computer Programmer, Construction, Dentist, Doctor, Engineer - Chemical, Engineer - Electrical, Engineer - Mechanical, Executive, Fireman, Flight Attendant, Food Service, Food Service Management, Homemaker, Investor, Judge, Laborer, Landscaping, Medical Technician, Military Enlisted, Military Officer, Nurse (LPN), Nurse (RN), Nurse’s Aide, Other, Pharmacist, Pilot - Private/Commercial, Police Officer/Correction Officer, Postal Service, Principal, Professional, Professor, Psychologist, Realtor, Religious, Retail Management, Sales - Commission, Sales - Retail, Scientist, Skilled Labor, Social Worker, Student - College Freshman, Student - College Graduate Student, Student - College Junior, Student - College Senior, Student - College Sophomore, Student - Community College, Student - Technical School, Teacher, Teacher’s Aide, Tradesman - Carpenter, Tradesman - Electrician, Tradesman - Mechanic, Tradesman - Plumber, Truck Driver, Waiter/Waitress
AK, AL, AR, AZ, CA, CO, CT, DC, DE, FL, GA, HI, IA, ID, IL, IN, KS, KY, LA, MA, MD, ME, MI, MN, MO, MS, MT, NC, ND, NE, NH, NJ, NM, NV, NY, OH, OK, OR, PA, RI, SC, SD, TN, TX, UT, VA, VT, WA, WI, WV, WY
BorrowerRate Ranges from 0 to .05
Most BorrowRates are between .099 and .310
Loans amounts range from 0 to $35,000.
75% of loans are for under $12,000.
All loans have a term of either 1,3 or 5 years.
The large majority of loans go to individuals who are employed full time.
A large number of Borrowers either don’t have a prosper rating or don’t have a credit score.
Few loans are given out to individuals with low income, or who are unemployed.
The number of loans given out dropped dramatically in the last quarter of 2008
The main feature of interest for this investigation is the BorrowerRate. Again, this investigation is primarily concerned with the factors influencing the borrower rate.
It is hard to say at this point which features will be most useful based solely on the univariate plots above. All of the variables mentioned above were selected for the investigation because they are likely to have an impact on the borrower rate.
HasCreditGrade -> CreditGrade is available
HasProsperRating -> ProsperRating is available
HasIncome -> IncomeRange is available and greater than 0
I log transformed StatedMonthlyIncome and DebtToIncomeRatio, which had left leaning distributions:
ProsperScore..Alpha,
CreditGrade,
Incomerange,
LoanOrginationQuarter
The median Borrower Rate was decreasing significantly from 2012 to 2014. The wide variation in mean BorrowerRate over time means it will be important to facet by quarter in the multivariate plots section.
## Term: 12
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0400 0.0929 0.1434 0.1501 0.2064 0.2669
## --------------------------------------------------------
## Term: 36
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1274 0.1815 0.1935 0.2599 0.4975
## --------------------------------------------------------
## Term: 60
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0669 0.1490 0.1870 0.1930 0.2319 0.3304
1 year loans have a slightly lower rate than the others. I will take out the 1 and 5 year loans, since most loans have a 3 year term, (as observed in the univariate plots section).
In order to get a better picture, I’m going to
cut the LoanOriginalAmount variable into Quantiles.
Smaller loan amounts have a significantly higher median borrower rate.
Again, I’m going to cut the DebtToIncomeRatio variable
into quantiles for a cleaner plot…
We can see from this graph that borrowers with a higher debt level relative to the others tend to have a higher borrower rate.
Both CreditGrade and CreditScore have exceptionally high correlation to the Borrower Rate
the BorrowerRate. Credit Grade is likely derived from the Credit Score.
This is verified in the plot below.
Prosper ratings are even more closely related to the BorrowerRate than credit grades. The Prosper Rating is likely a metric that the Prosper Loan Company uses to asses risk, based off of other parameters.
Once again, I’m going to cut the StatedMonthlyIncome variable
into quantiles for a cleaner plot…
The median BorrowerRate varies consistently with the Quantile of StatedMonthlyIncome.
Not surprisingly, there is a similar variation in BorrowerRate amongst Borrowers with different Income Ranges.
Unemployed borrowers have a significantly higher median BorrowerRate than the others
Homeowners have a lower median BorrowerRate than non-homeowners.
The worst States to get a loan are Maine and Indiana. The Best States are Alabama and North Dakota
In General, Higher Paying Occupations with a higher level of education (i.e. Judge, Computer Programmer Engineer) have a lower median borrower rate than lower paying occupations with lower levels of education (i.e. Teacher’s Aide, Nurse’s Aide, College Freshman).
The average Borrower Rate was decreasing significantly from 2012 to 2014.
The Quantile of DebtToIncomeRatio and Median BorrowerRate are correlated.
CreditScore has a significant correlation to BorrowerRate
CreditGrade is derived from Credit Score
Prosper Ratings are highly related to BorrowerRate.
The Median BorrowerRate goes down as IncomeRange goes up.
StatedMonthlyIncomeQuantile and the Median BorrowerRate are highly correlated.
Unemployed borrowers have a significantly higher Median BorrowerRate than the others.
Homeowners have a lower median BorrowerRate than non-homeowners.
The worst States to get a loan are Maine and Indiana. The Best States are Alabama and North Dakota.
In General, Higher Paying Occupations with a higher level of education (i.e. Judge, Computer Programmer Engineer) have a lower median borrower rate than lower paying occupations with lower levels of education (i.e. Teacher’s Aide, Nurse’s Aide, College Freshman).
CreditGrade is derived from CreditScore.
ProsperRating determines the BorrowerRate almost perfectly, and is likely a score assigned by the prosper loan company based off of other parameters.
The strongest Relationship was between BorrowerRate and ProsperRating, however ProsperRating its self is likely composed of other parameters, as noted above.
Aside from this, the relationship between BorrowerRate and CreditScore was the second strongest.
In this section, my main objective is to see how the observed relationships from the previous section hold up and/or change over time.
To start off with, I will investigate the relationship between CreditGrade, Prosper Rating and Credit Score in more detail.
We can see from these two charts that the relationship between the prosper rating and borrower rate and the relationship between the credit grade and borrower rate are farely consistent over time.
An unintended, but useful insight from the above visualizations is that the credit grade data goes up to Q2 2009, and the prosper rating data ranges from Q3 2009 onwards. Given that both parameters are closely related to the BorrowerRate (as ovserved in the bivariate plots section), this suggests that the Prosper Loan Company used the borrower’s CreditGrade or credit score up until Q2 2009 as a primary metric in order to determine the BorrowerRate and used the Prosper Rating thereafter.
Next, let’s investigate how the correlation between BorrowerRate and CreditScoreRangeLower changes over time.
The correlation between credit score and borrower rate varies significantly over time. This means it may be useful to model each quarter separately, or make models for different segments of time.
In particular, Correlation is higher Goes down after Q3 2011.
The bottoms of the plots for 2006 through 2010 have an upwards slant. It appears LoanOriginalAmount may play a role in the BorrowerRate at least up until Q2 2010.
Correlation between BorrowerRate and LoanOriginalAmount goes up in 2007 and back down around Q2 2013.
The relationship between IncomeRange and BorrowerRate appears to be much stronger after 2009. In order to get a clearer picture of this, The StatedMonthlyIncomeQuantile is plotted against the median borrower rate by quarter below.
It looks like Income is more closely correlated to the BorrowerRate after Q2 2009. This is verified in the chart below.
Correlation between StatedMonthlyIncome and BorrowerRate falls in Q2 2012
Next, we know from the previous DebtToIncomRatioQuantile v. Median BorrowerRate line plot that BorrowerRate and DebToIncomeRatio are related. Lets try and find where those relationships are particularly strong.
Correlation Between BorrowerRate and DebtToIncomeRatio is generally low, but goes up in Q2 2009, and back down in Q2 2011.
There appears to be a consistent pattern here. There seems to be a change around Q2 2009, when the prosper rating went into affect, and around Q2 2011 in the methodology used to calculate the loans.
##
## Call:
## lm(formula = BorrowerRate ~ LoanOriginalAmount + CreditScoreRangeLower +
## log10(DebtToIncomeRatio + 1))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.42386 -0.04944 -0.00974 0.04628 0.23322
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.500e-01 2.465e-03 223.09 <2e-16 ***
## LoanOriginalAmount -2.710e-06 4.201e-08 -64.51 <2e-16 ***
## CreditScoreRangeLower -5.258e-04 3.708e-06 -141.79 <2e-16 ***
## log10(DebtToIncomeRatio + 1) 1.969e-01 3.529e-03 55.79 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06473 on 79907 degrees of freedom
## (103 observations deleted due to missingness)
## Multiple R-squared: 0.3203, Adjusted R-squared: 0.3202
## F-statistic: 1.255e+04 on 3 and 79907 DF, p-value: < 2.2e-16
The R^2 value for this fit is quite low.
Given the variation in the median borrower rate over time, as well as well as the variation over time amongst several variables in the correlation to the borrower rate, there may be better results if models are created for different periods of time.
We know the following from the previous plots:
Correlation between CreditScoreRangeLower and BorrowerRate goes down around Q1-Q2 2011.
Correlation between LoanOriginalAmount and BorrowerRate goes down around Q1-Q2 2011.
Correlation between StatedMonthlyIncome and BorrowerRate goes down around Q1 2011
Correlation between DebtToIncomeRatio and BorrowerRate goes up in Q2 2009 and back down in Q1 2011.
As such, it seems like good intervals to split the data would be from the beginning to Q2 2009, and from Q3 2009 through Q1 2011.
##
## Call:
## lm(formula = BorrowerRate ~ LoanOriginalAmount + CreditScoreRangeLower)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.287011 -0.031315 -0.007581 0.024863 0.215443
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.115e-01 2.605e-03 234.75 <2e-16 ***
## LoanOriginalAmount 1.911e-06 6.238e-08 30.64 <2e-16 ***
## CreditScoreRangeLower -6.819e-04 4.213e-06 -161.85 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05183 on 26918 degrees of freedom
## Multiple R-squared: 0.5076, Adjusted R-squared: 0.5076
## F-statistic: 1.387e+04 on 2 and 26918 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = BorrowerRate ~ LoanOriginalAmount + log10(DebtToIncomeRatio +
## 1) + CreditScoreRangeLower)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.196538 -0.051013 -0.004821 0.050731 0.237776
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.020e+00 1.021e-02 99.948 < 2e-16 ***
## LoanOriginalAmount 1.661e-06 2.190e-07 7.585 3.69e-14 ***
## log10(DebtToIncomeRatio + 1) 1.471e-01 1.336e-02 11.010 < 2e-16 ***
## CreditScoreRangeLower -1.182e-03 1.484e-05 -79.670 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06755 on 8152 degrees of freedom
## (1013 observations deleted due to missingness)
## Multiple R-squared: 0.4953, Adjusted R-squared: 0.4951
## F-statistic: 2667 on 3 and 8152 DF, p-value: < 2.2e-16
Much better models resulted from the division of loans into separate periods. The log transform of DebtToIncomeRatio was included in the second model, but not in the first, since it did not improve the model. This makes sense, given that we saw that the correlation between DebtToIncomeRatio and BorrowerRate is generally higher after Q2 2009.
Both models benefited from LoanOriginalAmount as expected.
The inclusion of Stated Monthly Income did not have an effect on either model, so it was left out. This means that the observed relationship may have had more to do with generally better CreditScore or DebtToIncomeRatio for borrowers with higher income.
The average loan amount is closely correlated with the number of loans given out.
Credit score is correlated to Borrower Rate, although that correlation changes by quarter.
As of 2009, there is a strict cutoff at a credit score of 600.
There isn’t much variation in Borrower Rate by credit Score for those borrowers with a credit score of less than 600.
The correlation between the borrower rate and the loan amount becomes higher as the CreditGrade goes up.
The correlation between the borrower rate and the loan amount becomes higher as the Prosper Rating goes up.
There is a correlation between loan amount and borrower rate exists up until 2011.
The median borrower rate is related to IncomeRange.
The trend towards lower BorrowerRate for higher StatedMonthlyIncome is consistent accross all of the quarters.
The trend towards higher BorrowerRate for Higher DebtToIncomeRatio is consistent accross all of the quarters.
The quantile Of Available Monthly income is correlated to the median Borrower Rate.
CreditGrades were only used up until 2009, after which ProsperRatings were primarily used to determine BorrowerRate.
StatedMonthlyIncome Varies significantly more accross different ProsperRatings than it does accross different CrediGrades.
(The models apply to a subset of the data taken from 2006 to 2009 with a CreditScoreRangeLower of 600 or less.)
R^2:0.2746
R^2: 0.5189
R^2: 0.4953
The first model is the most general one, but has a very low R^2 value, and therefore does not represent the data very well.
The Second two models, with higher R^2 values, do a much better job of modelling the data. Niether model however is capable of taking into account the fluctuations in the BorrowerRate over time due to economic trends.
This plot shows borrower occupation on the y axis in order of median borrower rate, with highest median borrower rate at the top. BorrowerRate is shown on the x axis.
In General, Higher Paying Occupations with a higher level of education (i.e. Judge, Computer Programmer, Engineer,Professor) have a lower median borrower rate than do lower paying occupations with lower levels of education (i.e. Teacher’s Aide, Nurse’s Aide, College Freshman,Clerical,Bus Driver).
While it is not clear weather the borrower’s occupation contributes directly to the borrower rate, it is interesting to see how the general prominance of the Borrower’s occupation relates to the BorrowerRate.
This plot shows the median BorrowerRate, by quarter.
We can see from this plot that the mean borrower rate rises in 2010 and falls in 2013. More Generally, this plot demonstrates the variable nature of the BorrowerRate.
Given the relatively large number of loans, one might otherwise expect the median borrower rate to remain relatively stagnent relative to the rest of the loans over time. This variation means that the borrower rate is subject to economic trends over time in addition to statistics of the borrower and the original loan terms.
## [1] "R^2:"
## [1] 0.5075882
## [1] "R^2:"
## [1] 0.4953332
The above plots show two linear models of the Borrower Rate, the first for loans up to Q2 2009, and the second for loans from Q3 2009 through Q1 2011.
This shows that variation in the BorrowerRate from Q1 2006 to Q2 2009 is explained primarily by the LoanOriginalAmount and the Credit Score for the borrowers with a credit score of greater than 0. Additionally, variation in the BorrowerRate from Q3 2009 through Q1 2011 is explained primarily by the LoanOriginalAmount, the Credit Score and the DebtToIncomeRatio.
The prosper loan data set contains information on more than 100,000 loans from 2006 to 2014. My objective was to explore trends and factors contributing to the borrower rate. I started out by gaining an understanding of each of the variables in the investigation, and then went on to investigate the relationship between each variable and the borrower rate. I eventually explored the strongest relationships in further detail in order to get a better sense of when and where these relationships were strongest.Finally, I created a linear model using LoanOriginalAmount and CreditScoreRangeLower. This linear was insufficient, so I created separate linear models for loans before Q2 2009 and from Q3 2009 through Q1 2011.
The biggest struggle was working out how the different variables related to the borrowerRate over time. For example, DebtToIncomeRatio did not appear to be correlated at all to the BorrowerRate, until the correlation was taken by quarter. Another example is that StatedMonthlyIncome Seemed to have a relationship to BorrowerRate, but was found to be insignificant after it was finally applied in the linear model.
The model is limited primarily by the small number of parameters used in the investigation. The data set contains some 42 parameters, of which several (in addition to those used) may have some influence on the BorrowerRate including BorrowerRateCategory, Reccomendations, Investment from friends, and Investors. The model is also limited by various fluctuations in time in the median borrowerRate over time. Therefore, further investigation could use a wider array of variables in order to try to get a better model of the borrower rate. Further investigation may also take into account contextual economic data in order to account for the general fluctuations in BorrowerRate over time. For example, I could separate the data by borrowerRateCategory, and try to explain the variation within each category, or I might get data on the national interest rate, and use that as a parameter or multiplyer in the models.